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13. (Cancelled). 

14. (Cancelled). 

15. (Cancelled). 

16. (Amended) The method of claim 2, wherein said repeat sequences are postulated based 
upon amino acid sequences. 

17. (Cancelled). 

Claims 10-15 and 17 have been canceled. Claims 5-7 and 16 have been amended. Claims 
2, 3, 5-9, 16, 18-33, and 39 remain in the case. 

§ 112 Rejections 

Claim 17 is rejected by the Examiner under 35 U.S.C. 1 12, first paragraph, as failing to 
comply with the written description requirement. Claim 17 has been canceled, and Applicant 
requests this basis for rejection be removed from the case. 

Claims 5-16 are rejected by the Examiner under 35 C.S.C. 112, second paragraph, as 
being indefinite for failing to particularly point out and distinctly claim the subject matter which 
applicant regards as the invention. Specifically the Examiner states that claims 5-16 are 
indefinite for recitation of the phrase "said sequences" because it is not clear which of the 
sequences in the claims from which claims 5-16 depend the phase refers to. Claims 5-16 have 
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been amended to address the Examiner*s comments. Specifically, each claims that requires it 
(claims 5-7 and 16) has been amended to specifically reference either the "repeat sequences" or 
the query sequence." These amendments fully address the Examiner's basis of rejection and 
Applicant requests the basis be removed from the case. 

§ 103 Rejections 

The Examiner rejects claims 2, 3, 5, 7, 8, 18-20, 27 and 30 under 35 U.S.C. 103(a) as 
being unpatentable over Jurka et al. (1996). 

The Examiner takes the position that it would have been obvious to a person of ordinary 
skill in the art at the time the invention was made to modify the method of Jurka et al. (1996) by 
addition of newly determined repeat sequences to a repeat sequence database so that the repeat 
sequence database would be a more comprehensive listing of repeat sequences. 

The Examiner also rejects claims 2, 6, 15, 16, 19-24, 26-29, and 31-33 under 35 U.S.C. 
103(a) as being unpatentable over Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, 
and 30, and further in view of Altschul et al. The Examiner argues that it would have been 
obvious to a person of ordinary skill in the art at the time the invention was made to modify the 
method of Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 by use of 
analysis of ribonucleotide sequences, sequences that encode amino acid sequences, repeat 
sequence databases accessible through the internet, use of public domain databases GenBank, 
dbEST, and SwissProt, use of search algorithms BLAST and FASTA, and use of scoring 
matrices PAM and BLOSUM because Altschul et al. shows use of all of those features in the 

5 



Attorney Docket No.: LXGN-00104 PATENT 

context of searching sequence databases with query sequences whose repeat sequences have 
been masked. 



The Examiner also rejects claims 2, and 7-14 under 35 U.S.C. 103(a) as being 
unpatentable over Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above, 
and further in view of Jurka (1998). The Examiner takes the position that it would have been 
obvious to a person of ordinary skill in the art at the time the invention was made to modify the 
method of Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 by use of repeat 
sequences from a variety of organisms so that corresponding query sequences from the 
organisms could be analyzed and masked. 

Claims 2, 22, and 25 are rejected by the Examiner under 35 U.S.C. 103(a) as being 
unpatentable over Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above, 
and further in view of Sohocki et al. According to the Examiner's reasoning, it would have been 
obvious to a person of ordinary skill in the art at the time the invention was made to modify the 
method of Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above by use of 
TIGR Human Gene Index database because Sohocki et al. shows that the database is a useful 
source of human genes such as genes related to inherited retinal disorders. 
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Applicant's Response to § 103 Rejections 

First, Claim 39, which claim remains in the case and which claim has not been rejected 
under any § 103 basis would appear to be allowable. Applicant requests that the case be allowed 
at a minimum with claim 39 surviving. 

The reminder of the claims stand rejected as noted above under § 103 chiefly our Jurka 
(1996), alone in combination with Altschul, Jurka (1998), and Shohocki. 

The cited chief reference of Jurka (1996) fails to teach at least one of the key inventive 
points of the present invention. This failure in teaching is not cured by or obvious over any of the 
secondary references cited. All of the art cited deals with taking an "unknown" sequence and 
querying it against a known sequence database to see where it fits in the broader sequence 
picture (a typical search against a database of "known" things and then categorize and report the 
results). In doing so, all of the cited art teaches away from the present invention as noted below. 

Conversely to Jurka (1996) and the cited secondary art, the present invention teaches the 
computer how to deal with hoardes of random snippets of DNA sequence information, assemble 
them into contigs, and during the process "leam" how to identify and "mask" novel repetitive 
elements (which otherwise greatly confuse the assembly), and then reassemble the data via an 
iterative "learning" process that identifies new repeats and "remembers" to delete them from the 
subsequent assembly (by adding them to the "known" repeat/masking database). 

In fact, the type of searching and processing described in Jurka (1996) and the secondary 
cited art is clearly intended to be performed BEFORE the presently claimed invention/program is 
applied (i.e., each of the sequences is scanned against a "known" repeat database and all known 
sequences are masked prior to the sequence being used in the assembly). In addition, as directly 

7 



18069302 



Pergamon 



(K»7-«485(95)00023-2 



Computers Chem. Vol. 20, No. 1. pp. H9-I21. 1996 
Copyright © 1996 Ebcvkr Science Ltd 
Printed in Great Britain. All rights reserved 
0097-«48S/96 S15.00+0.X 



SOFTWARE NOTE 

CENSOR— A PROGRAM FOR IDENTIFICATION AND 
ELIMINATION OF REPETITIVE ELEMENTS FROM 
DNA SEQUENCES 

JERZY JURKA,* PAUL KLONOWSKI, VADIM DAGMAN and 

PAUL PELTON 

Linus Pauling Institute of Science and Medicine, 440 Page Mill Road, Palo Alto, CA 94306, U.S.A. 

(Received 19 October 1994: ir. revised fcrtr, 10 Fcbr^ry 1905} 

Abstrftct—CENSOR is a program designed to identify and eliminate fragments of DNA sequences 
homologous to any chosen reference sequences, in particular to repetitivte elements. CENSOR is based on 
two principal algorithms of Smith & Waterman ( 1 98 1 ) [7. Mot. Biol. 147, 1 95] and Wilbur & Lipman ( 1 983) 
[Proc. Natl Acad. Sci, U,S,A . 80, 726]. It includes several pre-set sensitivity levels based on both biological 
and statistical criteria which help to distinguish between aligned pairs of homologous and non-homologous 
sequences. CENSOR has been implemented in C/C+ + in the SUN/UNIX environment. 



INTRODUCTION 

Repetitive sequences are very abundant in eukaryotic 
genomes and, inevitably, almost every researcher 
working on newly sequenced eukaryotic DNA must 
deal with them. Usually, repetitive sequences are 
eliminated prior to GenBank/EMBL database 
searches. However, repeats are increasingly more 
often annotated and studied in the sequence context as 
integral components of the genetic material. 
Annotation and basic studies of repetitive DNA at the 
sequence level require specialized databases and 
computer software. A preliminary reference collection 
of human repeats and an on-line software for 
identification of repeats based on minimum length 
encoding method (PYTHIA) have been published 
before (Jurka et al,, 1992). The reference collection 
continues to be updated and released electronically via 
National Center for Biotechnology Information 
(NCBI repository). The identification of repeats by 
PYTHIA is based on the alignment of a sequence 
under investigation against the reference collection of 
human repetitive elements without automatic 
elimination of the identified repeats in the analyzed 
sequence. Furthermore, aligimiient based on minimum 
length encoding method is relatively CPU-intensive 
which imposes significant limitations on its 
widespread usage. This prompted us to develop a new, 
more efficient and user-friendly program, called 
"CENSOR", based on recently described principles 
for identification and analysis of repetitive DNA 
(lurka, 1994). Related software has recently been 
described by other authors (Claverie & States, 1993; 
Altschul et al., 1994; Claverie, 1994; Quentin & 
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Finchant, 1994). This welcomed development is likely 
to improve the quality of sequence data analysis in 
coming years. 

PROGRAM DESCRIPTION 

The basic steps implemented in CENSOR involve 
rapid comparison and alignment of reference 
sequences with a sequence under study, followed by 
replacement of homologous fragments by asterisks in 
the studied target sequence. The latter procedure is 
called 'censoring* (Jurka, 1994), and was first applied 
in studies of medium reiteration frequency (MER) 
repeats (Jurka, 1990). The CENSOR front-end 
interface permits to run DASHER3 (Faulkner, 1987) 
for fast sequence comparisons (Wilbiu- & Lipman, 
1983). Following the fast search is the crucial step of 
LOCAL alignment (Smith & Waterman, 1981) and 
subsequent evaluation and eliminadon of homologous 
sequences. 

The censoring procedure has recently been 
implemented in XBLAST under the name of 'masking* 
(Claverie & States, 1993; Altschul et aL, 1994; 
Claverie, 1994). Overall, CENSOR appears to be 
slower than XBLAST, but it is recommended over 
XBLAST whenever older and more diverse repeats are 
being searched. Furthermore, CENSOR uses the ratio 
of mismatches to transitions (see Jurka, 1994), in 
combination with alignment and similarity scores, to 
distinguish true homology from accidental similarity 
between sequences. 

To start the program, one simply types *censor' at 
the prompt sign. The main menu allows the user to 
choose and set various options for running CENSOR. 
The user can choose to run DASHER3, or proceed 
directly with LCXTAL alignment to assure maximum 



119 



18069302 



Jcrzy Jurka ei al. 



no 

sensitivity. The menu also provides a choice of using 
any one of the pre-set sensitivity options for sequence 
comparison or change any number of individual 
parameters as the user sees fit. These parameters 
are defined under the help menu option, along 
with additional instructions for running CENSOR. 
The remaining options in the main menu start the 
actual censoring process or restart an accidentally 
interrupted run. 



As indicated above, the user must supply two input 
files to run CENSOR. One of them is a reference file 
and the other the studied target sequence. For 
identification and elimination of repetitive DNA 
one should use repetitive elements as a reference 
file. A reference collection of human repeats has 
been described before (Jurka et aL, 1992) and its 
expanded and updated version is available from the 
NCBI repository. Reference collections of repetitive 



Seqaanoa allynmat (local. out) 

uurpuc file xocnuic • 

liOCOSl Kl N2 IOC0S2 Ml M2 Fl F2 F3 I. S # 



. . , aligned fragments . . . 

. . . statistics line from original file . . . 

where lll,N2rMl,M2 - aligned fragments boundaries 

Fl - (no. of Matches)/<no. of Matches + no. of Mismatches + no. of Gaps), 
F2 - (Number of Gaps )/ (Number of Misroatcliea > r 

which is set to 0 if (Number of Mismatches) 0, 
F3 - (Nuniber of Mismatches) / (Number of Transitions), 

which is set to 1 if (Number of Transitions) 0 

and to 100 if both nundaers 0, 
L - Length of the top sequence fragment , 
S - Local Score. 



Local parameters: 

Margin - 150 

Similarity treshold - 30.00 

Ratio treahold - 2.00 



ALUei 75 138 XYZ 30 93 0.94 0.00 4.00 €4 55.40 # 

AAGTTCGAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGC 

** ******«******4r***********4******** I ****************** **** 

AATTTCGAGACCAGCCTGGCCAACATGGTGAAACCCCATCGCTACTAAAAATACAAAAAATAGC 

& Containing 59 matches, 0 gaps and 4 mismatches including 1 
transitions 



Caasorad output (asap.out) 

; ID XYZ 

;D£ DNA SEQUENCE 
XYZ 

TTTTCATACTCCCAGGCAGGGACGTTCCT** ********************************* ****** 
*********************** X ACTAGC 1 



CllAinat«d seqpiofteM (plo.out) 

; LOCUS XYZ 

;DE DNA SEQUENCE 

; ALGNLOCUS XYZ ALUS 1 

; FRAQ1ENT 30 -> 93 

XYZ 

AATTTCGAGACCAGCCTGGCCAACATGGTGAAACCCCATCGCTACTAAAAATACJVAAAAATAGCl 



Fig. !. An example of output files from CENSOR. 
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sequences for other species are also available through 
the NCBI server and will be described in detail 
elsewhere. The sequences in the input files must be in 
the IG/Stanford format as previously described 
(Faulkner & Jurka, 1988). To distinguish between 
direct and complementary sequences, loci names 
should end with '@r and '@2\ This labeling permits 
the option to automatically reverse and complement 
sequences before they are stored in *plc.out\ 
Two formatting programs are distributed with 
CENSOR. 

Our analysis indicates that pending specific cases 
DASHER3 may be more or less sensitive than other 
programs for fast search (Pearson & Lipman, 1988). 
Therefore, the user should use direct LOCAL 
aligiunent option to verify the original output. This 
option is reconunended only for files pre-censored 
using the fast search. Using the approximate list of 
matches from fast search, the reference sequences are 
subsequently aligned with sequences under study 
using the LOCAL algorithm (Smith & Waterman, 
198 1 ) and the homologous fragments are censored out. 

CENSOR generates three final output files outlined 
in Fig. 1. The alignment results are stored in 
'local.out\ The fragments homologous to the 
reference sequences are cut out and stored in *plc.out\ 
The censored sequences are written to 'asap.out' with 
asterisks in place of repeats. One can choose to use 
other ASCII characters in place of asterisks. The file 
'asap.out* can be renamed and rerun against the 
reference collection under different conditions for 
possible identification and censoring of more distant 
repeats. However, one should remember that 
non-homologous sequence fragments will increasingly 
be censored out as one moves towards higher 
sensitivity levels. There are five pre-sct sensitivity levels 
which contain built-in parameters for identification of 
similarities and for distinguishing true homologies 
from accidental similarities. Sensitivity of CENSOR 
can be adjusted by changing the 'window* size, as 
well as, cutoff thresholds for similarity scores in 
DASHER3 and alignment scores in LOCAL. The 
lower the scores, the more distant similarities are being 
reported by CENSOR. As indicated above, one can 
skip fast search by DASHERS and proceed directly 
with local alignment. To reduce false positives, the 
similarity is evaluated based on the biological fact that 
transitions, i.e. A < - > G and C < - > T mutations, 
are relatively more common than transversions. 
CENSOR uses the ratio of transversions to transitions 
(Jurka, 1994) or, equivalently, the ratio of mismatches 
to transitions in the aligned pair of sequences, where 
mismatches represent the sum of transitions and 
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transversions between the aligned sequences. The 
expected ratio of mismatches to transitions, referred to 
as 'ratio', for a random match is 3: 1. For the majority 
of searches a 'conservative option* (No. 3), is 
adequate. It includes low cut-off thresholds for 
DASHER3 (4.5), low ratio of mismatches to 
transitions (2:1) and a relatively high LOCAL 
alignment score (30.0). If LOCAL score exceeds 3S.0, 
all alignments are being reported by CENSOR 
irrespective of the ratio of mismatches to transitions. 
The next level of sensitivity (no 4) sets LOCAL score 
at 22.0 and the ratio at 2.8:1. Beginning with this level 
one has to evaluate the alignments using criteria other 
than those implemented in CENSOR. 

Program availability 

CENSOR is currently available via the National 
Center for Biotechnology Information ftp server 
(ncbi.nlm.nih.gov). Login as 'anonymous* and use 
your email address when asked for the password. The 
software package is deposited in the 'repbase/censor* 
directory. In addition to CENSOR, two pieces of 
formatting software are included. The first one, 
*embi2ig\ converts sequence files from EMBL format 
to IG/Stanford format and the sedond« 'compseq*, 
generates complementary sequences with properly 
labeled loci names. The formatting programs are 
menu-driven and ask for input and output files. All 
software has been implemented in C/C rf + under the 
SUN/UNIX environment. 
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DE-FCj03-9IER61I52. 
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Sequence similarity search programs are versatile tools for the molecular biologist, 
frequently able to identify possible DNA coding regions and to provide clues to gene and 
protein structure and function. While much attention had been paid to the precise 
algorithms these programs employ and to their relative speeds, there is a constellation 
of associated issues that are equally important to realize the full potential of these 
methods. Here, we consider a number of these issues, including the choice of scoring 
systems, the statistical significance of alignments, the masking of uninformative or 
potentially confounding sequence regions, the nature and extent of sequence 
redundancy in the databases and network access to similarity search services. 
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The advent of rapid DNA sequencing technolog)^ in the 
mid-1970s led to an information explosion that continues 
unabated today. Molecular sequence data have become 
the common currency of biomedical research and often 
provide unexpected links among diverse biological 
systems. These connections accelerate research progress 
and may even open up entirely new fields of inquiry. One 
approach to discovering such connections, database 
"homology" searching, has been executed coundess times, 
often v^ith surprising results and has become an essential 
method for the molecular biologist. While the particular 
algorithm used is of course important, the effectiveness of 
database searches is dependent as well on a large number 
of correlative factors, many of which tend to be overlooked 
or dealt v/ith an an inefficient or ad hoc manner. These 
-include the following: 

Scoring systems. Most database search algorithms rank 
alignments by a score, whose calculation is dependent 
upon a particular scoring system . Usually there is a default 
system, but it may not be ideal for a user's particular 
problem. For example, haemoglobin subunits used to be 
regarded as "typical" proteins and are often still used as 
benchmark query sequences for evaluating new database 
search techniques and scoring systems. However today it 
is more common to encounter much larger and more 
complex sequences (see below) and methods developed 
and optimized for small, uniformly-conserved, single- 
domain proteins are inadequate. Scores that are best for 
detecting similarities between greatly diverged sequences 
differ from those best for detecting short but nearly 
identical segments'*^ Optimal strategies for detecting 
similarities between DNA protein-coding regions differ 
from those for non-coding regions^'^ Special scoring 
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systems for detecting frame-shift errors in the databases 
have recently been described'. A database search program 
should therefore make a variety of scoring systems available 
and users should be aware of which ones are best suited to 
their problems. 

Alignment statistics. Given a query sequence, most 
database search programs will produce an ordered list of 
imperfectly matching database similarities, but none of 
them need have any biological significance. An important 
question is how strong a similarity is necessary to be 
considered sui*prising. United by a common theory, a 
number of analytic^' and empirical results^-^^'^ are now 
available for assessing database search results. However, 
one still sees occasional extravagant claims in the literature, 
usually springing either from misapplication of the normal 
distribution or from an absence of critical statistical 
analysis. 

Databases. The use of an up-to-date sequence database is 
clearly a vital element of any similarity search. Sequence 
relationships critical to important discoveries have on 
occasion been missed because old or incomplete databases 
were employed. However, die variety of databases available, 
and their overlapping coverage, has the potential to render 
similarity searching cumbersome and inefficient. This no 
longer need be the case. Timely access to complete and 
"nonredundant" sequence databases has become relatively 
simple and inexpensive. 

Database redundancy and sequence repetitiveness. 
Surprisingly strong biases exist in protein and nucleic acid 
sequences and sequence databases. Many of these reflect 
fundamental mosaic sequence properties that are of 
considerable biological interest in themselves, such as 
segments oflow compositional complexity or short-period 
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Table 1 The BLAST family of programs 



Program* Query Database 
sequence sequences 



BLASTP protein 



protein 



BLASTN nucleotide nucleotide 
(both strands) 



BU\STX 



nucleotide 
(six-frame 
translation) 



protein 



TBLASTN protein 



Comments 

• Default scoring matrix*' is BLOSUM62; 
change with command line option 
"M=PAM250", for example . 

• Low-complexity masking with "-filter" 
option; choice of either the SEG^^ and XNU" 
algorithms 

• Parameters optimized for speed, not 
sensitivity; not intended for finding distantly- 
related, coding sequences 

• Automatically checks complementary strand 
of query 

• Very useful for preliminary data containing 
potential frameshift errors'^ 

• Nine different genetic codes available''; 
change with command line "C=1 " (vertebrate 
mitochondrial) for example 

• Low-complexity filter option as for BLASTP 

nucleotide • Essential for searching protein queries 
(six-frame against dbEST^° 

translations) • Often useful for finding undocumented open 
reading frames or frameshift errors in 
database sequences 

• Same genetic code options as for BLASTX 



"These programs are available through the BUVST Network and e-mail servers (see 
text) and the source codes are available by anonymous ftp on ncbi.nlm.nih.gov. 
"More than 65 different PAM'-^-^s-as-^ BLOSUM'^-'^ and other scoring matrices are 
available. PAM120 or BLOSUM62 are best for general purposes but a useful 
combination tor detecting strong and short to long and weak similarities ccnsisis of 
PAM30, PAM120 and PAM250 (ref. 2). 

'^Default genetic code (C=0) is "standard" or "universal" code. Other codes available 
include: 1, Vertebrate mitochondrial; 2, Yeast mitochondrial; 3, Mold mitochondrial 
and mycoplasma; 4, Invertebrate mitochondrial; 5, Ciliate macronuclear; 6, Protozoan 
mitochondrial; 7, Plant mitochondrial; and 8, Echinodermate mitochondrial. 



repeats. Databases also contain some very large families of 
related domains, motifs or repeated sequences, in some 
cases v/ith hundreds of members. In other cases there has 
been a historical bias in the molecules that have been 
chosen for sequencing. In practice, unless special measures 
are tal<:en, these biases very commonly confound database 
search methods and interfere v^ith the discovery of 
interesting new sequence similarities. Problems include 
the occurrence of misleading, spuriously-high scores, 
ambiguities in the phase of sequence alignments and 
overwhelmingly large output lists in which interesting 
results may be inconspicuously buried. We shall describe 
some recendy developed methods that largely solve these 
problems by automatically detecting and masking 
potentially confounding subsequences. 

Failure to deal properly with the factors described above 
can result in chance similarities being claimed significant, 
or biologically important relationships being overlooked. 
Here, we shall discuss these and several other issues in 
databa.<5e searching. V/hQe we will frequendy use the BLAST 
programs^-''' (Table 1 ) as examples, most of the questions 
considered have quire general relevance. 

Al^-jrithms and programs 

The earliest sequence comp: : son studies focussed on the 
alisr-. '/Ill* of coniplcle scni res However, with the 
rec^i;ni!!on that proiciiij fi ;uc--:ly -.hare only ^ f^idted 
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regions of similarity, corresponding for instance to 
structural motifs or active sites, attention shifted to 
algorithms for local alignment' ^2*. Essentially all databas- 
search methods have been based upon measures of local 
sequence similarity. 

In general, local alignments are assessed by means of a 
score, which is computed as the sum of scores for aligned 
pairs of residues and scores for gaps^^ How these scores 
are chosen, and what they signify, is discussed below. The 
time necessary to find ahgnments that optimize such 
scores is sufficiendy great that, for most practical purposes, 
either parallel architecture machines^^"^^ or heuristic 
methods such as Fasta^^-^^ are required. The problem mt ; 
be simplified by forbidding gaps. This leads to faster 
heuristic methods such as the BLAST algorithms*-^^ (Table 



-rc^; — * Ur.^A.. 



t Til orvi OT^t'f,f'I/\T->r'29 





While some sensitivity to weak similarities may be lost by 
eschewing gaps^°, easier generalization^' ' and rigorous 
statistical results^' become available. Alternatively, local 
alignments maybe assessed in a more sophisticated manner 
than by the simple sum of substitution and gap scores^-. 
This may lead to more sensitive detection of weak 
similarities, but at the price of greatly increase ' 
computation time". 

In general, the relevant considerations in choosing a 
particular algorithm are hardware requirements, speed 
and sensitivity to biological relationships. The tensions 
between these competing claims are resolved variously by 
programs such as Fasta^^ BLAST'"^ ajid Blaze^^ The relative 
merits of these and the other programs have been discussed 
at length elsewhere'^-". The idea of optimizing a measure 
of local similarity is common to virtually all popular 
programs, and the results they produce therefore do nr-t 
differ in any truly essential way. 

Local alignment statistics 

Not ail biologically important sequence relationships will 
be detected by sequence similarity search programs and, 
even when found, they may be lost among irrelevant or 
chance similarities. While experiment is the ultimate 
arbiter of biological significance, mathematical analysis 
can indicate which similarities are unlikely to have arisen 
by chance and therefore merit special attention. Thus ?n 
important question concerning alignments produced * y 
any database search is whether they can be considered 
statistically significant. 
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Fig. i The probabiP?y density function of the extrem.--. ^a\'J^^ 
distribution with ch.-.-dcleri tic value u=0 and decay 
constant /.-l. 
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One approach sonieti nes taken is to record an optimal 
local alignment score lor each database sequence and 
then to report these scores as stand:-.; d deviations from 
the mean. There are several serious and frequently 
unrecognized pitfalls to this procedure. First, the optimal 
scores for the comparison ofaquer)' sequence to different 



database sequences can not be assumed to be drawn from 
the same distribution. The longer a given database 
sequence, the greater the score expected by chance. Also, 
variation a residue composition among sequences can 
yield different score distributions. Second, unless a 
rigorous optimization algorithm is employed, the true 
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optimal pairwise scores will be systematically 
underestimated and the. shape of the true distribution 
will be ill-determined. Third, comparing a query sequence 
to a set of uniform length random sequences yields scores 
that obey not a normal but an extreme va/we distribution 
(Box 1 and Fig. 1). The tail of this distribution decays 
exponentially in x rather than x^, so assuming normality 



tends grossly to exaggerate an alignment's significance. 
Finally, a database search involves many essentially 
independent trials. If the database contains 50,000 
sequences, a score with probability 10^ of arising from a 
single comparison is only marginally significant in the 
context of the complete search. The last two points alone 
imply that an alignment may easily achieve a score over 
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ten itar .:ard deviations from the mean yet fail to be 
statistically significant. 

Box 1 discusses the extreme value distribution and how 
it may be used to calculate the probability that a gap-free 
local ialignment with a given 'score wpiiild arise from the ■ 
comparison of two random sequences. It also describes 
ho wto modify this probability to account for the "multiple 
tests" of a database search. Such a search can itself generate 
data which provide an alternative to the analytic method 
(Box 1 ) for estimating alignment statistical significance'^ 
For a given query, one records the best alignment score to 
each database sequence. If score Sis observed/fSj times, 
then plotting log f(S) versus 5 tends to produce a straight 
line; extrapolation of this line can yield estimates of 
statistical significance 'I 

One advantage of this approach is that it is applicable 
to cases for which no rigorous theory is available, such as 
scores from gapped alignments. Thus heuristic programs 
such as Fasta^^ or parallel implementations of the Smith- 
Waterman algorithm'^ such as Blaze^^ or Blitz2^ can 
estimate statistical significance using this method. 
Furthermore, because the scores generated derive from 
comparisons of real sequences, no "random protein" 
model is needed. A disadvantage of the method is the 
need to generate optimal alignment scores for a substantial 
fraction of database seqiiences in order to calculate 
statistical significance. Potcntii^l inaccuracy arises from 
variation in database sequence size and composition, 
which implies that each data point is really drawn frojn 
a separate distribution^-'^"''. Also, if many sequences 
related to the query are preseiit (see discussion on database 
redundancy below), it may be difficult to base the plotted 
line upon only unrelated sequences. An alternative "curw 
fitting" approach is to estimate the parameters of the 
implicit extreme vakie distribution for the scoring system 
at hand'''°'"''\ In one form or another, curve fitting will 
generally be necessary to calculate the statistical 
significance of scores derived from gapped alignments or 
other complex scoring systems'-'^'^ 

The most important "failure" of the local alignment 
statistics discussed here is on comparisons of regions with 
restricted or unusual amino acid or nucleotide 
composition. Such regions are quite common in proteins, 
but are clearly not well described by the same random 



m.odel used for other sequence regions (see below) . Because 
an alignment of such "low complexity" regions has htde 
real meaning, it is best simply to note their existence, but ' 
exclude them from alignments produced in database 
searches (see Figs 2 and 3 for examples).' • ' ' ■ ' '''' " ' ' 

Scoring matrices and gap costs 

Many different amino acid substitution score matrices 
have been proposed over the years for use with sequence 
comparison and database search programs^'"^\ and a 
variety of rationales have been used for their construction. 
However, it is possible to show that in the context of 
seeking high-scoring segment pairs without gaps, any 
such matrix has an implicit amino acid pair frequency 
distribution that- characterizes the alignments it is 
optimized for finding. More precisely, let p. be the 
frequency with which amino acid i occurs in proteins 
sequences and, wiuiia llic class of alignmcntG sought, let 
q.. be the frequency v/ith which amino acids i and*; are 
afigned. Then the scores that best distinguish these 
alignments from chance are given by the formula 

5.. = log^ 

The base of the logarithm is arbitrary, affecting only die 
scale of the scores. Any set of scores useful for local 
alignment can be wTitten in this form, so a choice of 
substitution matrix can be viewed as an implicit choice of 
"target frequencies" g..(refs 1,6). 

The target frequencies characterizing alignments of 
closely related sequences clearly differ from those for 
alignments of sequences that are greatly diverged. 
Therefore a single matrix can not be optimal for 
recognizingrelationships at all evolutionary distances 
1 ; has been argued that for most practical purposes, tlu'ee 
separate matrices should be adequate for locating all 
alignments containing sufficient information to rise above 
background noise''^ The question remains how best to 
estimate the appropriate corresponding target frequencies. 

Estimating the frequencies with which the various 
amino acids tend to mutate into one another is a 
jiccessarily empirical problem. The first approach to the 
question was taken by Dayhoff and coworkers^''^^ Their 
"PAM" model of molecular evolution allowed target 
frequencies and the corresponding score matrices to be 



<Fig. 2 Significant sequence matches of the human MTG8 product: the effect of low-complexity masking. MTG8 (ref. 84} is the translated 
product of a chromosome 8 gene involved in a t(8:21) translocation that results in an AML1-MTG8 fusion transcript in a case of acute 
. myeloid leukaemia (GenBank accession number D14820). a, Automated segmentation of low-complexity sequences in MTG8 at relatively 
high stringency. To be defined as low-compiexity in this run of the SEG algorithm (Box 2), a sequence region must contain at least one 12- 
residue window with complexity (K, Box 2) less than 0.31 5. SEG then finds the minimally probable (lowest P^. Box 2) low-complexity 
subsequence, of any length, within the overlapping windows of this region. The sequence segments read from left to right and their order in , 
the polypeptide runs from top to bottom, as shown by the central column of residue numbers, b. The strong match, which emerges clearly 
v/ithout masking (Poisson p-value 2.5 x IQ-*), between sections of MTG8 and Drosophila mefenogasfer transcription factor TFIID 1 10-kDa 
subunit*^^ c, MTG8 filtered as in (a) but with the low-compiexity segments masked by "x" characters, for use as a query sequence In 
database searches, d, The significant match between a region of MTG8 containing a cysteine cluster and rat apoptosis protein RP-8. RP-8 
(ref. 87) is a gene expressed early in the process of programmed cell death (apoptosis) following glucocorticoid Induction in rat thymocytes 
(GenBank accession number MSOeOI). This match»^, had a Poisson p-value of 0.0036 for a BLASTP search of the NCBI non-redundant 
database of 13th September 1 993. *, Identical amino acids; I, Conserved Cys or His residues. Also shown is a sample of the class of zinc- ^ 
fingers that occur in the DNA binding domain of the steroid receptor family«^ indicating a suggestive similarity (which is not statistically 
significant by pairwise alignment statistics and would require experimental confirmation) in the positions of most of the Cys or His residues. 

Before low-complexity filtering. MTG8 generated an output li-.' from the NCBI non-redundant database of greater than 400 Kbytes 
containing 599 database sequences scoring above the BLASTP default threshold. The significant match to apoptosis protein was an 
inconspicuous 62nd in this list and scored much lower than many spurious low-complexity matches. After masking of MTG8 as in (b), this 
match was 6th in a list of 83 sequences. The latter list contained many matches to a "medium complexity" region of MTG8 which is 
tentatively predicted to be alpha helical coiled coil (residues 416-^76). Further filtering with SEG at lower stringency [K < 0.365 for a 14- 
residue window) effectively masked this region, and resulted in a BLASTP output list of only 9 sequences, in which the apoptosis protein 
was ranked in score only below the MTG8 self-matches and the match to TFIID 1 1 0-kDa subunit. 

Genetics volume 6 february 1994 123 



review 




corraspbndvtG^he 20^ prb6^biHt[e^sr^.tf^^ 
. acids. ^F9r;the pN^|^pfi|^ equatiqnsf?f^^|^^;^§^ 

:.;SEGf J-is ;an^opt^ 's0g mentatipn'^^ /on the ihecry described !tjf . 
ide^tifies^vafa^id^ stringency/^ 




regp^rdiess a^Oal>§^jS^ and SEG 

. ^Prograrns 'sucH/a|^^ appt ppri ate;q u^ry -seque^^^^^^ rr; snts 

''<^lcurate^;'as scoreslinithatrow^^ columhVinsunng^ ; ;■; , ; 



matrices are perhaps nearly optimal 
for this more general case. Gappj?d 
alignments present the additio: .il 
problem of choosing appropriate gap 
costs^^ The simplest algorithms 
require these costs to be a Hnear 
function of gap length^^"^", but 
efficient algorithms for more general 
gap costs are also available^'. Because 
no theory exists, appropriate gap costs 
have generally been chosen by trial 
and error, although there have been 
some recent efforts to give l- is 
problem a sounder empirical 
footing"'^\ 

The user of database search 
programs should recognize that the 
defaultsubstitution scores and, where 
applicable, gap costs, have generally 
been chosen to be appropriate for the 
most frequent sort of queiy. These 
scores may not, however, be optimal 
for a specific problem. In partici .r, 
matrices such as PAM-120 or 
BLOSUM-62 (the current BLASTP 
default)*" are tailored for alignments 
of moderately diverged sequences. 
Ver}' strong but short similarities, or 
very long but weak ones may easily be 
missed by these matrices'"^ A fully- 
functional databuse ;;vaich system 
should therefore provide a ranr.e of 
scoring systems to itr. users, so i-it 
the algorithm can be adapted to he 
problem at hand. 



calculated for any desired amount of evolutionar)^ change. 
The details of the PAN4 model have been criticized''**, and 
the vast increase in available sequence data has prompted 
recalculation of the model's parameters'*^*''^ Scores for 
DNA sequence comparison based on a PAM-like 
mutational model have also been described^ A different 
approach to estimating appropriate target frequencies 
relies not on fitting an evolutionary model, but rather on 
the direct observation of relatively distant, but 
nevertheless presumed largely correct, sequence 
alignments'*'. A variety of empirical tests have been 
claimed to support the superiority of the resulting 
"BLOSUM" matrices for detecting sequence 
homolog/*'^-\ Lacking; an evolutionary model, hov/ever, 
this approach is less adaptable to generating matrices 
tailored to specific appli.-jations-'"'. 

The theory linking substitution matrices with target 
frequencies is rigorously established only for local 
alignn^erjt:;!v:kinggaps.There(;.'rethcdevciopmenttihove 
is generally valid on'- for ihc BLAS'" and related 
ak:orithms*'-'' A mr- i:v;ici;;i theory i jr alignmenis 
wii.h i^:ps .'.id. ' ..^v. -. . iuivc the san.ie broad 
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Databases and access 
The most important requirement for 
database searching is a 
comprehensive, up-to-date database. 

Full releases of GenBank® now occur 

eveiy two months, and daily updates 
are available for downloading or direct searching b 
mail and network semces""*. GenBank has undergo..: a 
major expansion in data coverage and now includes, in 
addition to nucleotide sequences, data from the major 
protein sequence and protein structure databases, as well 
as data from U.S. and European patents^". Approximately 
36% of the records in GenBank are produced by the 
international collaborators, EM BL Data Library'-'^ and the 
DNA Database of Japan (DDBJ), with whom database 
updates are exchanged daily. Copies of the databases are 
available at many sites worldv^^ide^^-''^ 

GenBank (release 80.0) contains 164 megabase of 
sequence and is doubling in size eveiy 2] months \0- 
Benson, personal communication). This rate can only 
increase as a result of genome projects <-'id automated 
sequencing technolog^^ As mentioned above, special 
purpose computers have a role in maintai-iing reasonable 
search perfcnnance in the wake of this data deluge, but 
considerab. ' improv.- ments in search efficiency can be 
obtaii.ed bv considering the nature of the data i- [i. 

1\ i a n )' s c . : c nee d a t ab a ses h - ve a 1 a r gc de i; roe o ' : " ' ' 
"rcdund:r:.y'* for r istO:ij/l ^:Gn^ rdaltd to . 
techno!'.'!;)' ^^^-d isearc'. u-- r,, and ah ^ dui; -^^ 
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exis^:ence of clusters of closely- related sequences from 
multif^ene families. Also, cciuivalent gene products - ave 
freqi' - ndy been sequenced in a number of dilterent species 
or organisms. In release 36.0 of PIR InternationaP^ for 
example, there were 65: members of the globin 
superfamily, 349 cytochromes c, 583 sequences mth 
immunoglobulin domains and 274 prot■^nn kinases. 
Considering only perfectly matching sequences, among 
the 52,257 protein sequences in this database, there are 
over 3,900 dupUcate entries and over 3,800 perfect 
substrings of longer entries that together comprise about 
1 0% of the total amino acid residues. Among nucleic acid 
sequences there are thousands of Alu variants in GenBank. 
And the problem of redundancy is only getting worse: as 
a result of projects designed to sample expressed genes 
rapidly'' tens of thousands of sequence fragments are 
being added to the databases'^; many of these sequences 
represent small pieces of known genes. Due to the error- 
prone nature of these sennpnre fr?.gnientc^°, idcntif/ing 
redundancy in these collections is a more difficult task. 

As well as decreasing the speed of database searches, 
redundancy can obscure novel matches in the output, by 
)^elding slews of similar or identical alignments. Practically, 
there are two simple ways to avoid this problem: i) 
construct a smaller "nonredundant" database^^ ii) 
preprocess the query sequence for the presence of known 
domains and mask these prior to searching. (The concept 
of query masking is discussed in the next section.) 

NCBI^- maintains tv/o quasi-non redundant sequence 
collections (NRDB), one for proteins and one for nucleic 
acids. For example, the protein NRD3 is constructed 
iterntively starting with SWlSS-PROr^ which is the 
smallest and least redundant of the major protein 
datiibases. Al! of the proteins in PIR InternationaP^' are 
compared to those in SWISS-PROT, and identical 
sequences are excluded from the former while maintaining 
pointers to relevant annotation. Next, all of the protein 
translations from GenBank coding sequences ("GenPept") 
are compared to the merged SWISS-PROT plus PIR. 
Likewise, protein sequences from the Brookhaven 
structure database (PDB) and other sources are 
incorporated into NRDB. (The OWL nonredundant 
sequencedatabase^' is constructed from the samesources.) 
This simple procedure reduces the size of the combined 
databases by 50%, yet ensures that all sequences are 
represented. More sophisticated methods for creating 



:ierived, composite v: .ws of proteii': and DNA aence 
data promise even further reductions-'^"*. 

Another key issue is access to the databases. Researchers 
may perform da abase similarity searches remotely by 
sending their queries, via electronic niail, to centralized 
"server" computers, where large and frequendy updated 
databases are maintained, and w^here fast processors and 
sophisticated software are available. E-mail services of 
this sort have been available from various sources for 
several years. For example, NCBI provides the BLAST e- 
mail server (for more information, send a "help" message 
to the Internet address blast@ncbi.nim.nih.gov), and 
EMBL provides Blitz (nethelp@embl-heidelberg.de). 
Additional sites and servaces are given in vef. 64. In addition 
to database search and retrieval services, such sites maintain 
repositories of public domain software and specialized 
datasets that may be accessed via "anonymous ftp" over 
the Internet^^ The existence of high-performance network.*; 

Is also giving rise to a new generation of "client-server 
appUcations" that make possible direct, real-time user 
interactions with remote sei-vers. NCBI's BLAST network 
service and Entrez retrieval system are two examples. For 
users of the many excellent commercial so ftware packages 
for sequence analysis, we woidd anticipate the development 
of network client-server capabilities in the near future. 

Masking of low-complexity sequences 
Interspersed local regions of very simple amino acid 
composition are surprisingly abundant in protein 
sequences". Some of these regions are homopolymers or 
si iort-period repeats, but most are not pc riodic and appear 
as mosaics of predominantly one or a fev.^ types of residue. 
Their compositional bias is in marked contrast to the 
structural domains and motifs of globular proteins familiar 
from crystal and NMR structures. Based on a relatively 
stringent definition of low-compiexit/'', more than half 
of the sequences in the database contain at least one such 
region, and 14% of the amino acids occur in clusters of 
highly biased local composition. Moreover, a large excess 
of "medium-complexity" regions may be defined using a 
less stringent definidon of complexity: these are found in 
many recently-deduced protein sequences that lack true 
homologues and do not belong to the class of "ancient 
conserved sequences"^^ Very litde is known about the 
molecular structures, dynam ics, interactions and evolution 
of most low- and medium -complexit)' protein segments. 



-IFig. 3 The mouse protein Sos1 functions as a key intermediate in transmitting signals from receptor tyrosine kinases to ras via protein-protein interactions^ . 
5: J31 (PIR accession 821391) is a member of a family of ras guanine nucleotide-releasing proteins (GNRP) that also includes S. cerevisiae CDC25 and SDC25, S. 
p.mbe Ste6 and the Drosophila gene, Son of sevenless^\ Mouse Sos1 is a large, mosaic protein with several different domains, including a rasGNRP domain and 
a low comple j region that binds to an "adapter" protein called Grb2». a. Results of a BLASTP search using an Sosi query sequence without any masking 
applied. In ac • ion to several "self hits" in the output, we see significant matches to some S. cerevisiae proteins, but Ste6 does not appear in the top 2o matches 
despite its pr-., .:nce in the database (PIR International, release 37). Moreover, the true positive matches are interspersed with many false positives, consisting of a 
number of funciionally unrelated proline-rich proteins. These artifactual matches are highly significant in the statistical sense, but a glance at some of the local 
alignments shows that one is not justified in inferring similar function despite the high scores and low p-values. An identical search, except that in this case the 
Sosi query has been pre-processed using SEG masking with default parameters. Note that the top of the "hit list" is now populated only by bona fide members of 
the rasGNRP family and that all artifactual matches against proline-rich proteins have,disappeared. Furthermore, a match to S. pombe. Ste6 is now obvious; a local 
alignment between this protein and Sosi is shown. Interestingly. Sos1 shows significant local similarities to histone H2A and p-spectnn (see below), c, Results of 
another search with masking of both low complexity regions (b) and the rasGNRP domain. The top four matches now consist only of those proteins that share 
more extensive, or global, similarity with the query beyond the rasGNRP domain. In this example, the additional information gained by this extra masking step is 
nu striking. But one can imagine the dramatic effect this would have in shrinking the "hit list" if the query possessed a kinase domain, of which there are hundreds 
of examples in the database. (See ref. 74 for an example involving immunoglobulin domains), d. The query sequence, mouse Sosi . annotated with the vanous 
domains indentifiable by BU^STP searching. The rasGNRP domain is according to Boguski & McGormick^'. The proline-rich carboxy temninal region is known to 
interact with Src homology {SH3) domains in Grb2^. With regard to the local similarities between Sosi and histone H2A and p-spectnn. it has recently been shown 
that Sosi, p-spectrin and a number of other proteins possess "pleckstrin homology" or PH domains". TTie local alignment produced by BLASTP (c) corresponds 
to these PH domains. The similarity between Sosi and histone H2A has not previously been reported and is difficult to interpret biologically. Nonetheless, the 
similarity is as significant as that of the PH domain and may have structural, as opposed to functional, implications*'. 
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Low-complexity segments confound database search 
algorithms in two ways. First, most of these segments do 
not generally give meaningful alignments position by 
position in ways that reflect actual structure and m uta tional 
history: they evidently evolve relatively rapidly by processes 
such as replication slippage and repeat expansion". (At 
the DNA sequence level, trinucleotide and dinucleotide 
repeat polymorphisms provide a familiar example**^*^°.) 
Permutations, shuffles or reversals of low- complexity 
amino acid sequences generally give alignment scores 
similar to the original sequence. Second, the residue 
compositions of low-complexity segments are very 
different from that of the database as a whole. This is 
evident if all low-complexity segments in the database are 
grouped into a single class: a strong excess of alanine, 

alvrinp T^rnli'np cprinp alnfomatp anH alntaminp rpiiilfc;, 
, ^ , - - , (J — - 

However, this lumped class is itself heterogeneous, 
containing for example glutamine-rich and proline-rich 
subclasses. These statistical biases contrast with those that 
characterize the bulk of most query and database 
sequences, and on which score-based alignment statistics 
are founded. Thus the high scores of alignments of low- 
complexity segments are due primarily to their 
compositional biases and do not necessarily reflect 
significant positional similarity. 

Several classes of low-complexity residue clusters have 
been analysed for statistical significance by Karlin and 
coworkers^'-". Their methods, which use the contrasting 
residue frequencies of specific clusters and those of 
complete proteins or databases, are embodied in the SAPS 
software". SEG", the algorithm employed by the BLAST 
programs for filtering low-complexity segments from 
queiy sequences prior to database searching (Figs 2 and 
3 ) , employs instead optimal segmentation methods applied 
to a more general definition of compositional complexity 
(see Box 2). 

iVIasking of highly abundant sequences 

Database searching can be performed efficiently i n phases, 
with a query first compared to a small database containing 
domains representative of large sequence families. 
Subsequences of a query that match one or more of these 
domains can then be masked prior to full-scale searching, 
thereby eliminating most of the redundant output^^ 
Annotated collections of prototypic human repetitive 
sequences^^ such as Alu and protein kinase catalytic 
domains'^ exist and can be used to pre-filter a querj' (Fig 
3c). (Both of these data sets are available from the NCBI 
Data Repository on CD-ROM and by anonymous ftp. See 
/repbase/alu,/repbase/humrep and/pkinases/pkcdd.faat 
ncbi.nhn.nih.gov.) For proteins, a more comprehensive 
solution to the problem is approached by building a small, 
representative set of protein superfamilies or motifs and 
using this as a screening database with automatic masking 

1 . Ailschul. S.F. Amino acid substitution matrices from an information theoretic 
perspective. J. molec. Biol. 219. 555-565 (1991). 

2. Alischul, S.F. A protein alignment scoring system sensitive at all evolutionary 
distances. J. molec. Evol. 36. 290-300 (1993). 

3. States. DJ.. Gish, W. & Altschul, S.F. Improved sensitivity of nucleic acid 
database searches using application-specific scoring matrices. Methods 3. 
6&-70 099-:). 

A . Gish. W. & State?. P.J. loentification of proit-in oc:fi"5H rt=.gions by datn!)ase 
simitanty searcJ-i. •■■j;u''C Gencl. 3. 265-272 {1993}. 

5. Clave-ie. J.-M. ! . . riir.;: frameshifts by ami: i acid sequence co-.-ianson. 
J. nic: -r Biol. 2:. * ' ■?■* (1993). 

6. Kv:;r ?.AIt5Cl- . .ii.F. ' ■ w-ds to: .-. -rif- :in^ static':!..- ' 2 
o' ir.:^ ocular sequor.ce if^: tv; . ci f^r-r,cr. • ;'Cor:r;a sz''y'-:.- 

nsU I. Acad. Sc;. U. S.A. 87 4 . .; ■ 1930). 



of matching query subsequences (unpublished results). 
This technology is stiU .lender development .ljut,re,pen^ 
studies indicate that a representative set of only' lj66()- ' 
3,000 sequences may suffice^^; such a databsise can be 
searched in seconds. The first large-scale implementation 
of this strategy has been performed for a specialized 
database of "expressed sequence tags" or ESTs^° where 
such pre-filtering is also employed to detect contamination 
by vector sequences. 

Conclusions 

The stated goals of the U.S. Genome Project include the 
production of 50 megabases of DNA sequence data prv 
year by 1998 and the identification and correlation of 
genes in humans and model organisms'^ Database 
sijTiilarify sparr.hing will he one of the major informatir.*; 
tools used in this endeavor. Not only efficient algorithms, 
but also a choice of appropriate scoring systems, well- 
defined measures of statistical significance and a better 
understanding of the sequences themselves, are critical 
for the automated analysis schemes that this amount of 
data will inevitably require. 

Special purpose and faster general purpose compute- s 
will have roles in sifting through this increasing volume \. f 
sequence data. But large improvements in the efficiency 
of searching can be obtained by considering the nature of 
the data and implementing new strategies that capitalize 
on this knowledge. One of these strategies is to preprocess 
a query sequence to identify known domains and motifs, 
dispersed repeats, low complexity segments and other 
regions of compositional bias such as potential membrane- 
soanning and a-hehcal coiled-coil regions. We have 
describedseveralpreprocessinfjtechniqucsthataresuitah'e 
for automation and have demonstrated their practi - al 
utility w^th examples. Foreknowledge of queiy features 
enables one to perform faster and more effective searches 
better and to evaluate search results. 

Another, complementary strategy is to reduce the 
redundancy in the target databa5e(s) to be searched. 
We have outlined one simple but useful approach to 
the reductive merging of diverse, but overlapping, 
source databases. But newer, cleaner and richer views 
of the sequence data, optimized for gene discovery, nre 
on the horizon. 

Note added in proof: NCBI has recently established a 
GenBank® World Wide Web server (the URL is http:/ . 
/www.ncbi.nlm.nih.gov) that provides network access 

to many of the software tools and data sources described 
in this review. 
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For hundreds of millions of years, perhaps from the very 
beginning of their evolutionary history, eukaryotic cells have 
been habitats and junkyards for countless generations of 
transposable elements, preserved in repetitive DNA 
sequences. Analysis of these sequences, combined with 
experimental research, reveals a history of complex 
'intracellular ecosystems* of transposable elements that are 
inseparably associated v^ith genomic evolution. 
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Abbreviations 

L1-EN endonucleolytic domain in LI reverse transcriptase 

LINE long interspersed nuclear element 

LTR long terminal repeat 

MIR mammatian-wide interspersed repeat 

SINE short interspersed nuclear element 

TE transposable element 

TSD target site duplication 

Introduction 

Repetitive I)N.\ is a major component of eukaryotic 
genomes, rnderstandin^ its ()ri;;in, e\ ()hition, and ;^enetie 
impact upon the host DNA is therefore of fundamental 
importance for genome studies. There are two major 
groups of repeats in eukaryotic ^^eaomes: tandemly repeat- 
ed satellites, usually eon fined to s pec i tie chrom(»snmal 
regions: and the repeats interspersed with /genomic DNA 
til at are the major focus (»f this rc\ iew. interspersed 
repeats represent mostly inacti\ e copies of a wide \ ariety 
of contemporarily and historically acri\ e transposable ele- 
ments ('I'Ks) such as: retroelements and DNA rrans- 
posons, which can each he further subdivided into distinct 
classes Repetiti\e sequences have been recruited as 
functional components of eiikary tJtic genomes, which doc- 
lunents their c(»ntribution to f^enomie evolution |2-I>1. 
I'hey are also an important source of knowiedj^e about the 
bioloj^y of active TK.s. I'he emer^iin^^ picture, bolstered by 
recent research, is that TKs are not merely * parasites'. 
Rather, they are integral players in j^enomic evolution, 
showin;^ either a "selfish' or an 'altruistic' nature, depend- 
in on different evolutionary circumstances. 

Reconstruction and analysis of repetitive DNA 

As stated above, interspersed repetitive secpiences repre- 
sent inactive (pseudo^^eneKopies of historically tireonteni- 
pomrily active TKs. The study of a new TK usually bej^ins 
with the identification of its repc-ated copies, followed by 
set|uence alignment, classification into subfamilies <if 



applicable) and construction of consensus se(|uences 17]. 
Apart from the original 'I'Ks themselves, consensus 
se<|ucnces represent the best available approximations of 
the ori.t;inal active 'I'l^s that generated the repeats. Ki^ure 1 
illustrates the relationship between the similarities of indi- 
vidual repeats to perfect c<tnsens!is se(|uences as compared 
to similarities between repeats theinseUes [7|. According to 
Fi^;ure I, repeats 37-52^ similar to each other will be 
SS-liV/f similar to their perfect consensus seipicnces. 
Without such improvement in similarities, the search lor 
diverse repeats and other biolo*»icalIv meaninj;ful se(|uence 
comparisons may be counterpnKluctive. 

Figure 1 
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The similarities between a source gene and its repeats as a function of 
the similarities between the repeats. The x variable indicates the 
average similarity between repeats sharing a common source gene: y 
represents the average similarity of repeats to their source gene that 
can be approximated by a consensus sequence. For example, repeats 
that are on average 50% similar to each other will be >68% sinnilar to 
their ideal consensus sequence. Adapted with permission from 17] . 



One can reconstruct ancestral 'I'Ks even with limited 
se(iuence data, especially if individual copies are m>t very 
dix erse. Additional information may be taken into account, 
such us the hif;h mutability of (;p(; dinucleotides or the 
presence of open readin;^ frames in which nonsense muta- 
tions can be re\ ersed. This has been dramatically demon- 
strated for the 7/7-like HXA transposon from fish, named 
Shrfiifr^^ lUauty. w hose transposase was reconstriictetl from 
a do/.en inactive copies. Its acti\ ity has been demi>nstrat- 
ed not only in the fish from which it originated, but also in 
human 1 leLa cells [«*'|- 'I'his work, and an earlier study 
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dcm«»nsr racing the transfer of u man tier element from 
iynm[iliil(i to i j'hfimaff 'uf (9**1, arc inipfmunt steps towards 
application of DNA transposons in genomic studies. 

Reconstructions of TKs are very hihtjr intensive and 
retpiire biological insif^ht hut they often remain impub- 
lishcd. In order to promote the dissemination of this infor- 
mation and to credit the individual effort that ^oes into 
producing it, a new electronic publication entitled 
Rephase I'pdate was established (lO*!. Repbase I'pdatc 
represents a systematic attempt to integrate C(»nsensus 
sc(iuence data, nomenclature, biol(»|^ical classification and 
other rele\ ant information int(* a colicrcni resource neces- 
sary for seipience studies. To date, over *J5() difterent 

compiled fr(mi all available cukaryotic secpicnce data (see 
Table 1 ). Of these, over HilO are interspersed repeats. Most 
interspersed repeats from vertebrates and plants (~S()%) 
have been assigned to one of the following major care- 
^;ories: non-Ion^; terminal repeat U.TR) retrotransposons or 
rctroposons also known as SINI%s and IJN'Ks, and l-TR- 
retrorransi>osons including retroviruses and ON A trans- 
posons. I he remaining nonplant, non vertebrate repeats 
come from \ery diverse species, ranging from prot<izoans 
to octopuses, and are temporarily collected under the arbi- 
trary name of 'invertebrates'. \w this group, the fraction of 
interspersed repeats assigned to a particular categorx- is sig- 
nificantly lower {M)-AWf ), mostly due to insuiTicient com- 
parative scijuencc data necessary for the construction of 
reliable consensus se<|uences. This group of repeats is 
expected to h<ild many 'missing links' in our understand- 
ing of the origin and e\-olution of 'l'l{s. 

lUmian and rodent secjuenccs can be screened against the 
most recent \ersion (»f Repbase [ pdate using public 
ser\ers [11,121. Repeat annotation and masking is recom- 
mended prit)r to ex(m ideniifieatitin [13,14| but Repbase 

Table 1 



The current content of Repbase Update. 



Type of repeats 


File name 


Number of 
(sub) families 


Human repeats 


humrep.ref 


284 


Alu subfamilies (primate) 


humsub.ref 


16 


Processed pseudogenes (human) 


pseudo.ref 


20 


Rodent repeats 


rodrep.ref 


157 


Other mammalian repeats 


mamrep.ref 


96 


Oiher vertebrate repeats 


vrtrep.ref 


74 


Plant repeats 


plnrep.ref 


87 


Invertebrate repeats 


invrep.ref 


222 


Simple repeals (microsatellites) 


simple.ref 


131 


Total 




1087 


Unique 




956 



Updated human and rodent collections are also available from public 
servers for the autonrratic annotation of DNA sequences [11.12). Recently 
computed proportions of repeats in the nonredundanl human sequence 
data are as follows: Alu (1 2.3%); UNEl (1 1 .9%); MIR (1 .6%); UNE2 
(2.1%); LTR retrotransposons and endogenous retroviruses (5.6%); DNA 
transposons (1 .8%); simple repeats (1 .4%); other -0.35%. 



rpgrade is increasingly bein^ used for the direct studies of 
repetitive DNA. 

The genomic fossil record 

The genomic fossil record of past retrop<»sitions can be of 
^rcat value not only fj>r studies of TKs themselves, but also 
for population and phylof;enetic studies of their hosts. Vux 
example, youn^^ Alu (SINK) subfamilies ha\e been useful 
for human population studies, 'lb date, there arc five 
known Alu subfamilies (^'al . ^'a.S, Yb.S, VaS and VbS) acti\ e- 
ly proliferating in humans |1(),15|. Recent innovative stud- 
ies of 57 Ya5 Alu secpiences, 1.^ of whicb are polymorphic 
in the human k^mic pool, led to an estimate of himian efVec- 
tive population si/e us inji coalescence theory |U^"|. This is 

rK.. - : : t' u... 1..-:-.- i: - 

based (»n Alu retrop(»sition. 

Turning to older short interspersed nuclear element 
(SINK) families in mammals. Okada's ^roup |I7'*| 
obtained a phyhij^enetic resolution of the h)n^ disputed 
relationship amon^ whales, ruminants, hippopotamuses 
and pijis. They have shown that two SINI\ families, called 
(;HR-1 and (iIIR-2, are present exclusi\ely in the 
genomes of whales, ruminants and hip|>opotamuses, which 
toiicther form a monophyletic j^roup distinct fnun that of 
pij^s and camels. This finding c(»ntradicts pre\ ious phylo- 
;^enies and illustrates the powerful use of the genomic fos- 
sil record in complementing the paleontoloj^lcal record 
which is particularly difficult to obtain for whales. 

Another whale-related de\ elopmcnt w as the identificati(»n 
of homolo^iy between the basic units of ctmimon satellites 
and \A elements, representing; the most abundant LINK 
elements in manuiials .Satellites ha\e lon^ been 

viewed as a product of une(]ual crossin;^ over, however, 
there is no evidence that they can orij;inute fiv ntKo ftom 
nonfunctional 'junk' DNA. The homol<»;;y between IJ 
and these satellites supports this scenario and raises many 
interesting^ i|uesrions about satellite ami genomic e\oIij- 
tion. Another interesting; link between satellites and TKs 
is the honiolo^^y between the centromere-associated pro- 
tein (OKNP-H) and the fio^ao family of 'I'Ks althou.t;h bio- 
lo^jical interpretation of this fact remains tentative |I9,2()1. 

Retro (trans) position: a continuation of the 
transition from the RNA to the DNA world? 

Ver\ little is kn(»wn about the orij^in (»f TKs but it is con- 
ceivable that the ' Tl! world', can be traced all the way back 
to the bcf^innin^ of the transition from the hypothetical 
RN.Vbased i^enome to the DNA-based one. {'unn this 
point of view, the entire ^en<imie DNA mi^;ht ha\ e e\ ol\ ed 
with close participation of TKs, starting with retroposon-like 
elements. Many TKs mi^ht have evolved into panisites. par- 
ticularly those that can mi;irate between ditVerent hosts, but 
.some may still retain their original properties as ';;enomc 
builders\ The examples of /-)m\Y>/)>^//// non-LTR retroposons 
II e' I -A a n d ' I A R' I u h ich m a i n ta i n le I * )nie re s i n D/osop////// 
Ul'*,221. combined with the recently reported homolojjy 
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between tclome rases and reverse transcriptases |2.^",24**1, 
bring us closer t(» this brcjad perspective [25], 

In this context, it may be worthwhile to revisit recent 
research on the extensively studied mammalian LI 
(LINE!) elements. The orij^in <»f active mammalian LI 
elements remains obscure, but they have produced a suc- 
cession of numerous subfamilies during the past UK) mil- 
lion years or so [26], and they continue to be active at least 
in humans and rodents [27',2KI, In spite of their assumed 
'selfishness*, \A elements seem to exhibit some remnants 
of "altruistic' features that are e(»mpatible with active par- 
ticipation in genome cvcjiucion. '/'hey are responsible for 
adding over 24*X of the DNA to the human genome, only 
about half of which is LI DNA (see legend of Table I and 
(12]). l-nlikc other LINI% elements that are parasiri/ed by 
SINKs homologous to their 3' ends |2*^]. Lis apparently 
retropose a large variety of SINK elements and niRNAs 
see below) that have no obvious structural relation- 
ship to their own RNA, with the possible exception of 
p(i!y(A) tails [31). 'I'his is consistent with a recent study 
demonstrating the ability of LI re\ erse transcriptase to eftl- 
ciendy generate cDNA from RNA with no seciuence speci- 
ficity and including transcripts from cellular genes (32'1. 
Kven the affinity of LI reverse transcriptase for polyadeny- 
lated RNA hanging around the ribosomal system (311 rnay 
be interpreted as a remnant of the original participation of 
LI predecessors in the retroposition of protein encoding 
RNA. Another relevant property may be the ability of LI 
reverse transcriptase to heal chromosomal breaks, although 
there is some debate as to whether this cannot be attributed 
to nonhomologous recombination events 1.^3,.V4). 

Diversity and co-evolution of TEs 

The genomic fossil record deposited in eukaryotic 
genomes sh(»ws that autonomous TKs tend to be accom- 
panied by nonautonomous companions that are unable to 
proliferate themselves. Kxamples include transposon dele- 
tit >n fragments [.\S..V>|. SINK elements ht»mologous r(» ,V 
ends of LINK elements [29], and defective LTR retro- 
transposons, including defective endogenous retroviruses. 
To multiply, the first group must be able to use transptwase 
from intact DN.*\ transposons, .SINK proliferation depends 
on LINK-encoded reverse transcriptase and the remaining 
retroelemenrs probably rely on intact viruses for their 
reproduction. There may be a delicate balance between 
the autonomous and nonautonomous groups of TKs. anal- 
ogous to the balance between species in c(miplcx ecosys- 
tems. Autonomous elements prtjliferating out of contn*! 
may destroy their hosts. NonautononKKis elements may 
destroy themselves by ^successful' ct»m petition for the 
reverse transcriptase or irans|>osase produced by the 
autonomous TKs. Transposase titrati<)n by defective trans- 
|-M)S(ms has been discussed among possible factors for the 
restriction of the activity of mariner-like transposable ele- 
ments in natural populations (3b), although more special- 
ized mechanisms, such as overproduction inhibition, and 
missense mutation effects are \ iewed as more pmminent 



events in limiting proliferation of DN.A transposons. 
Multiple LINKl and SINK (Aiu, BL B2, BCl, etc.) sub- 
families in mammals may be viewed as examples of the 
ongoing co-evolution that is driven by c(»m petition for 
reverse transcriptase [26,.^()**.371. LINK2 and mammalian- 
wide interspersed repeat (MIR) elements |12| might have 
bectmie extinct as a residt of similar competition. Among 
general mechanisms for the restriction of TKs on the 
genomic side, suppression by (-pCi methylation and hete- 
rochromatini/ation have recently been discussed 14.3S,391. 
Overall, our knowledge of the mechanisms c(»nt rolling 
TKs at the genomic level is still fragmentary- 140]. 

(lo-evolution between auton«)mous and nonautcmomcms 
elements mav not be sufficient to acccumt for the diversity 
of endogenous retroviru.ses and retroviral-like elements in 
mammals. .Almost half of all the human repetiti\ e elements 
deposited in Repbase I 'pdaie |10*1 are either diverse LTRs 
or fragments of viruses and LTR retrotransposons. although 
they represent less than h7< of the human genome ( see leg- 
end of Table 1). In this context, it is wc^rth menticming a 
renewed interest in co-evolution between endogenous and 
exogencius retroviruses that could benefit the host [4K421. 
Other related possibilities include recurrent infections and 
recombinations between distantly related viruses (\\" 
Kapitom»v and J Jurka. unpublished data I. 

Targeting the mammalian genome 

Sei|uence analysis of target site duplications ( TSDs) of retro- 
posed elements from mammals [.^()'*I. combined with the 
independent disco\ er>" of the endoniicleolytic domain in LI 
rexerse transcriptase (Ll-KN. reviewed in 131]), hnnight 
about a recent breakthrough in our understanding t)f rerro- 
poson integration in mammals. The consensus se(|uence of 
rSDs and adjacent regions for LI. .Alu, IDiBCl), Bl, B2. 
and processed pseudogenes is ITI.W•V^^N),^_sT^"l NIR, 
where R denotes purines, Y represents pyrimidincs and N is 
any base. The \ertical bars show predicted positions of 
breakpoints on the oppcjsite strands of double-stranded 
DNA |3()",371. rriA^WA resembles consensus sequence 
nicked by the Ll-KN 143**1. an additi(mal argument impli- 
cating I A reverse transcriptase in the retroposition of nonau- 
tcmomous retro|>os<»ns. The general consensus sequence o( 
the l^Ds may combine different sulxlasses of targets. I-'or 
example, targets beginning with 11 IA(fA.\ are longer on 
axerage than the targets beginning with 'ITLV-VW (J Jurka. 
impublished data). Different target preferences may be relat- 
ed tt> different active Lis (27' |. 

The conserved set|uenccs aroimd both breakpoints in the 
consensus .se<iuence given ab(»ve appear to be different fmm 
each other, but separate analyses indicate that both 
setiuences are enriched with kinkable T.V (^\ and TCi din- 
ucleotide steps, which suggests a similar mechanism by 
which both breaks are generated |44*]. This mechanism may 
be of general significance since the kinkable dinncleotides 
are cimserved in targets both for DN.\ transix>scins and for 
insertion elements in bacteria |44*1. 
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In jnalojry tu the iw»iicl of inicrj;raii*tn ufinsca K2 n(»n- 
\'A \< rctn»p()S(in |45|, the reverse iransLTiption of mam- 
malian rcrroposiiiis may he primed l>y the .V DNA ends 
e\jM>scd !>y nickinj;. Alrhtuij;h seH-prin^in;; *)!' retroposahle 
RNA has been reeently tlemonsrraretl /// z /'/ro |4(j|, irs rule 
in the retropositidii tif maniinalian retroposons may be 
niar;;inal if any. 

It has lonji been known that tloiible-scrantleil breaks stimii- 
hue homolo^^oiis reeonibinarion. 'Hierefore, l)\A targets 
exposed to I.l-K\ niekintc aeivity may be reeombinationa! 
hot spits in mammalian ;;en(»mes. This may have implica- 
tions for the undersrandinj; of at least some of tiie fraj;iie 
chromosomal sites involveil in the origin of .generic iliseases. 

Conclusions 

The rcx ersc flow (jf information from RNA to DNA mi^ht 
have hiid a definite he«;innin^ in rhe history of Hie, but it has 
ne\er ended. It remains an inte)i;ral parr of the on;;oini; 
genomic evoUition in eiikarvotic species. It is manifesreil in 
active retrnpnsons and in their fossil recoril as interspersed 
repetitive DNA. 'These are the major conch isions erne rj;in[; 
from recent pn>^ress in tlie field. Ikised on these eonchisions, 
the one-dimensional interpretation of TMs as ^parasites' or 
'selllsir elements should be transformed into a more bal- 
anced \ iew, with their dixerse roles comparable to the bio- 
lojiical roles of individual species in exolvin^ ecosysten)s. As 
the diverse world of 'VEs continues to enierjie with new 
se(|iieace data. TI-'.s are increusinf^iy bein^ explored in a 
broad ran.i^e of biolf)t^ical problems, from phylo^enetic and 
popuLition studies to;ienorne cnj^ineerin^. 
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More than 100 genes causing inherited retinal dis- 
eases have been mapped to chromosomal locations, 
but less than half of these genes have been cloned, 
ivlutations in ruan^ i-eiiiia/pineai-specific genes are 
known to cause inherited retinal diseases. Examples 
include mutations in arrestin, rhodopsin kinase, and 
the cone-rod homeobox gene, CRX, To identify addi- 
tional candidate genes for inherited retinal disorders, 
novel retina/pineal-expressed EST clusters were iden- 
tified from the TIGR Human Gene Index database and 
mapped to specific chromosomal sites. After known 
human gene sequences were excluded, and repeat se- 
quences were masked, 26 novel retina and pineal 
gland cDNA clusters were identified. The retinal ex- 
pression of each novel EST cluster was confirmed by 
PGR assay of a retinal cDNA library, and each cluster 
was localized in the genome using the GeneBridge 4.0 
radiation hybrid panel. In silico expression data from 
the TIGR database suggest that these EST clusters are 
retina/pineal-specific or predominantly expressed in 
these tissues. This combination of database analysis 
and laboratory investigation has localized several EST 
clusters that are potential candidates for genes caus- 
ing inherited retinopathy. O 1999 Academic Press 



INTRODUCTION 

Although more than 100 genes causing inherited 
retinal diseases have been mapped to chromosomal 
locations, less than half of these genes have been 
cloned (RetNet, http://www.sph.uth.tmc.edu/RetNet). 
Many of the mutations leading to inherited retinal 
disorders have been identified in genes that are ex- 
pressed predominantly in the retina and pineal gland. 
Photoreceptors and pinealocytes are developmentally 
related and also share expression of many genes in- 

Sequence data from this article have been deposited with the 
EMBL/GenBank Data Libraries under Accession Nos. G42173- 
G42198. 
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volved in phototransduction. Therefore, novel genes 
with expression patterns limited to these two tissues 
are potential candidates for inherited retinal disorders. 

The retina and nineal planH hnth nHcHnatp *imKnrrj- 
logically from the most anterior region of the neural 
plate, the diencephalon (GUbert, 1994), Development 
and differentiation of tJiese organs are also related, as 
many of the same developmental genes, such as the 
homeobox genes Xrxl (Casarosa et al, 1997) and Crx 
(Chen eta]., 1997), have expression patterns limited to 
the developing retina and pineal gland. Furthermore, 
mammalian pinealocytes are evolutionarily related to 
photoreceptor cells (Vollrath, 1985) and express a se- 
lective group of "retinal proteins" that are involved 
in the phototransduction cascade, such as rhodopsin 
kinase, phosphodiesterase, and transducin (Lolley et 
al, 1992). Neonatal pinealocytes express both "rod- 
specific" and "cone-specific" phototransduction compo- 
nents, and different subtypes of pinealocytes may ex- 
press varying combinations of these phototransduction 
enz3anes, similar to the different subtypes of photo- 
receptors in the retina (Blackshaw and Snyder, 1997) . 
Inherited retinal diseases have been associated with 
mutations in retina and piaeal gland transcription fac- 
tor genes, such as the cone-rod homeobox gene CRX 
(Freund et aL, 1997, 1998; Sohocki et al, 1998; Swain 
et al, 1998; Swain et al., 1997), as well as in genes 
involved in the phototransduction cascade, such as ar- 
restin (Fuchs etal., 1995; Nakazawa etal, 1998; Wada 
etaJ., 1996) and rhodopsin kinase (Khani etal, 1998; 
Yamamoto et al, 1997). 

The goal of this study was to identify novel retina/ 
pineal-specific EST clusters as potential candidate 
genes for inherited retinal disorders using a combina- 
tion of database analysis and laboratory investigation. 
Expressed sequence tags (ESTs) are partial cDNA se- 
quences that are being identified from tissue-specific 
cDNA libraries by large human genome centers and 
are deposited into databases, such as GenBank dbEST. 
The TIGR Human Gene Index database (http://www. 
tigr.org/tdb/hgi/hgi.html) lists assembled clusters of 
ESTs, which usually arise from the same transcript, 
and organizes these clusters according to tissue expres- 
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sion. We identified EST clusters expressed in the ret- 
ina and pineal gland from the TIGR database, elimi- 
nated clusters that are expressed in additional tissues 
or represent known genes, and eliminated clusters 
composed of repeat sequences only. PGR primers were 
designed to the remaining 26 clusters and used to 
confirm retinal expression as well as to localize the 
gene encoding each EST cluster within the genome. At 
least 7 of the retina and pineal gland expressed genes 
identified in this study localize within the minimal 
candidate region of mapped inherited retinal diseases. 

MATERIALS AND METHODS 

Identification of retina and pineal gland clusters. The TIGR Hu- 
man Gene Index database release version 3.3 was searched on 
July \r 1S2G iOr EoT clusLers witii at lease iu% pineal transcripts. 
Only clusters including retina and pineal gland ESTs, or Including 
retina, pineal gland, and brain or cancer tumor ESTs, were stud- 
ied further. Any repeat sequences within a cluster were masked 
by the RepeatMasker program (http://ftp,genome.washington. 
edu/RM/RepeatMasker.html) before BLAST homology searches 
were performed (Altschul et aL, 1990) using the NCBI server 
(http://www.ncbi.nlm.nih.gov/BLAST/). Clusters identified by 
BLAST as representing known genes were excluded from the 
study, as were clusters identified by BLAST that include ESTs 
from tissues other than retina, pineal gland, brain, or cancer 
tissues. 

Localization of clusters and confirmation of retinal expression. 
Localization involved optimization of PGR primers, PGR prod- 
uct analysis to confirm identity, and radiation hybrid mapping. 
PGR primers were designed for an STS of each cluster using the 
Primer3 program (http:/Avww-genome.wi.mit.edu/cgi-bin/primer/ 
primer3.cgi). Primer pairs were optimized for PGR of human 
genomic DNA using a standard protocol of 35 cycles with AmpliTaq 
Gold polymerase (Perkin-Elmer) and an annealing temperature gra- 
dient generated in a Stratagene Robocycler thermocycler. The result- 
ing DNA fragments were separated on standard 2% agarose gels. If 
the resulting fragment was not of the expected size (indicating either 
an intervening intron or the wrong product), the fragment was 
treated with shrimp alkaline phosphatase and exonuclease (Amer- 
sham), followed by manual sequencing with the AmpliGycle Se- 
quencing Kit (Perkin-Elmer) and a primer end-labeled with ^P on a 
6% Long Ranger (FMC Bioproducts) denaturing acrylamide gel. 
Each cluster was localized in the genome by PGR assay with the 
same primers (using optimized conditions) in the GeneBridge 4.0 
radiation hybrid panel (Research Genetics). Results were submitted 
to the GeneBridge 4.0 mapping server at the Whitehead Institute 
(http://carbon.wi.mit.edu:8000/cgi- bin/con tig/rhmapper.pl) using a 
minimum lod score of 15 for placement. The resulting mapping 
information was then compared to the Stanford (http:/Avww-shgc. 
stanford.edu/Mapping/index.html) and Whitehead Institute (http:// 
carbon.wi.mlt.edu: 8000/cgi-bin/contig/phys_map) radiation hybrid 
maps for identification of the chromosomal band containing the gene 
encoding the cDNA cluster. Each cluster was then assayed for retinal 
expression by PGR in a human retina cDNA library kindly provided 
by Dr. Jeremy Nathans (Nathans and Hogness, 1984), followed by 
separation of products on a 2% agarose gel. The sequence, PGR 
primers, and amplification conditions for each STS developed in this 
study are available in GenBank and dbSTS (NGBI) (Table 2). 

RESULTS 

Identification of Retina/Pineal cDNA C J asters 

Retina and pineal gland cDNA clusters were selected 
for mapping by the following strategy. First, all clus- 



TAJBLE 1 

Retina/Pineal-Specific THC Clusters 
Representing Known Genes 



THG name 


Gene name, MIM" No. 


78331 


Interphotoreceptor retinoid-binding protein, 




IRBP, 180290 


86178 


Guanine nudeotide-binding protein. 




jS polypeptide 3, 139130 


100760 


cGMP phosphodiesterase, jS polypeptide, 180072 


164291 


Torsin B. DYTl, 128100 


166839 


Paired box homeotic protein 6, PAX6. 106210 


172410 


Synaptophysin p38, 313475 


175189 


Recoverin, 179618 


175673 


Transducin, y-subunit, 189970 


177643 


Retinoschisis protein, XLRSl, 312700 


213359 


Ghimaerin, ^2 glial fibrillary acidic protein. 




6U2857 


215703 


Guanylyl cyclase activating protein, 




GCAP, 600364 


216888 


Voltage-gated potassium channel. 




KCNBl, 600397 



"Mendelian Inheritance in Man (http://www.ncbl.nlm.nih.gov/ 
Omim/). 



ters with 10% or more pineal gland transcripts were 
identified in the TIGR Human Gene Index. After du- 
plicate clusters and clusters from the 5' and 3' ends of 
the same clone were eliminated, 1047 clusters or ESTs 
remained. The remaining clusters were then scanned 
for those with (i) one or more retinad-expressed ESTs 
but (ii) no ESTs from other tissues, except brain or 
cancer cells. Forty-five clusters containing retina and 
pineal gland ESTs remained. Twelve of these were 
excluded because they were found to represent known 
genes (Table 1). The remaining 33 clusters were then 
tested by BLAST analysis for highly similar sequences 
in the GenBank dbEST database. Four clusters 
CrHC233355, THC230448, THC201975, and THC229881) 
were excluded from further study because EST se- 
quences from other tissues were identified in dbEST. 
In addition, THC224189 was excluded because PGR 
primers for its assay could not be designed, as the 
majority of its sequence is AIu repeat sequences and 
there was no opposite end information for any of the 
cDNAs in the cluster. Two clusters, THC 133954 and 
THC 198187, were highly similar by BLAST analysis 
to genomic clones with known localizations: THC133954 
overlaps with 12PTEL055, which maps to 12pl3.3, and 
THC198187 overlaps with 425C14, which maps to 
6q22. The remaining 26 retina/pineal-specific clusters 
were judged to represent novel genes with unknown 
localizations. 

Localization of Clusters 

The primer pairs for the STS for each of the 26 
clusters were optimized in genomic DNA prior to PGR 
assay in the radiation hybrid panel. On optimization, 
the STS fragment for THCl 58983 was much larger 
than expected; however, sequencing revealed that the 
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TABLE 2 

Retina/Pineal-Specific Clusters Mapped in This Study 





THC 




No. of 


No. of 






Candidate tor 


Laboratory 


cluster 


GenBank 


pineal 


retinal 


Number of other 


Mapping 


inherited retinal 


ID 


name 


Accession No. 


ESTs 


ESTs 


ESTs 


location 


disorder^ 


MMSOl 


90422 


G42173 


3 


2 


0 


17pl3 


RP13 


MMS02 


90997 


G42196 


1 


1 


0 


12q24.1 




MMS03 


133968 


G42174 


1 


1 


1 infant brain 


9p21 




MMS04 


137161 


G42175 






0 


12pl3,31 




MMS05 


137267 


G42197 


1 


1 


0 


Xq21-q22 


OPAl (Kjer type) 


MMS06 


153932 


G42176 


1 


1 


1 cancer 


3q29 


MMS07 


154909 


G42177 


1 




0 


llq25 




MMS08 


157357 


G42178 


1 


1 


2 brain, 2 cancer 


9q22.3 




MMS09 


158470 


G42179 




16 


0 


llql3.3 


EVR 


MMSIO 


158983 


G42195 


1 


1 


0 


9q22.3 




MMSll 


160180 


G42180 


1 


1 


1 cancer 


6p23 




MUSI?. 


160504 


n.AO^Q^ 




4 




IpGG 




MMS13 


160521 


G42182 


1 


1 


0 


lp22.1 




MMS14 


163082 


G42183 


1 


2 


0 


5ql4 


WGNl/ERVR 


MMS15 


174321 


G42184 


3 


4 


2 brain 


10q22-3 




MMS16 


177310 


G42185 


2 


2 


3 brain 


19pl3.3 




MMS17 


177379 


G42186 


3 


4 


0 


19ql3 




MMS18 


180397 


G42187 


5 


2 


25 brain 


2q37 




MMSI9 


195887 


G42188 


1 


1 


0 


12ql3 




MMS20 


195934 


G42189 


1 


1 


0 


lq31.1 


RP12 


MMS21 


202304 


G42190 


6 


4 


23 brain 


19ql3.4 




MMS22 


207703 


G42191 


1 


2 


2 brain 


8p22 




MMS23 


210727 


G42192 


1 


1 


0 


10q26.1 




MMS24 


220430 


G42198 


1 


2 


0 


17pl3 


RP13 


MMS25 


229889 


G42193 


2 


2 


1 brain 


15q24.1 


RP,MR 


MMS26 


229891 


G42195 


1 


2 


0 


5q31 





' Candidates were mapped to published candidate region for these loci. RP13, retinitis pigmentosa 13 locus; OPAl, optic atrophy 1 locus; 
EVR exudative vitreoretinopathy; WGNl, Wagner syndrome; ERVR, erosive vltreoretinopathy; RP, MR refers to recently reported retinitis 
pigmentosa with mental retardation locus (Mitchell et aL, 1998). 



fragment included an intron flanked by coding se- 
quence that matched the predicted coding sequence for 
tills cDNA cluster. Table 2 presents the STS name, 
number of cDNAs of each type within the cluster, and 
chromosomal mapping location for each novel cluster 
mapped in this study. 

Confirmation of Retinal Expression 

As confirmation of retinal expression, the STS se- 
quences for each cluster were assayed by PCR in an 
adult retina cDNA library. Although some of the se- 
quences, such as MMSIO, MMS13, and MMS26, pro- 
duced only a weak amplification, one sequence, 
MMS05, failed to amplify from the library. 

DISCUSSION 

Many of the known mutations leading to nonsyn- 
dromic inherited retinal degeneration are located in 
genes with either retina-specific or retina/pineal-spe- 
cific expression patterns. Moreover, these mutations 
are usually found in genes whose expression in the 
retina is limited to the photoreceptors, such as rhodop- 
sin, peripherin, or CRX (Freund etal, 1997). However, 
identification of photoreceptor-expressed genes as can- 



didates for inherited retinal disorders has required 
tedious laboratory experiments such as in situ hybrid- 
ization. Because the pinealocytes and photoreceptors 
are developmentally and functionally related, this 
study focused on genes with expression limited to the 
pineal gland and retina, with the expectation that 
many of these will be expressed in photoreceptors. 
cDNA clusters that also included brain cDNAs were 
not excluded from study, as brain transcripts may be of 
pineal origin. In addition, cDNA clusters that also in- 
cluded cancer tissue transcripts were not excluded, 
because tumor cells may express transcripts not ex- 
pressed in the nontransformed tissue. 

Retinal expression of each novel retina and pineal 
gland cDNA cluster in this study was confirmed, with 
the exception of THC137267. The STS for this cluster 
failed to amplify from the retinal cDNA library, but 
amplified from the genomic DNA of the radiation hy- 
brid panel. TIGR lists a single adult retinal cDNA for 
this cluster, and it is likely that the cDNA was not from 
a gene normally transcribed in the retina. 

The cDNA clusters that were identified by this 
method as representing transcripts of known genes are 
described in Table 1. These findings prove the validity 
of this approach for identifying candidate genes for 
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inherited retinal disorders, because mutations of some 
of these genes, such as the retinoschisis protein and 
the paired homeobox gene 6 (PAX6), £ire associated 
with inherited retinal diseases. 

The 26 novel retina and pineal gland expressed 
"genes" that were identified and mapped in this study 
are shown in Table 2. The term genes is used loosely, as 
it is possible that STSs that map close to one another 
may be from the same gene, for example, MMSOl and 
MMS24 on chromosome 17 or MMS08 and MMSIO on 
chromosome 9. It is also possible that transcripts from 
tissues other than the retina and pineal gland may be 
identified later for some genes mapped in this study. 

Seven of the STSs localized in this study fall within 
the published candidate regions for mapped inherited 
retinal diseases as shown in Table 2. The phenotypes of 
these autosomal loci include dominant retinitis pig- 
mentosa (RP13, MIM No. 600059), recessive retinitis 
pigmentosa (RP12, MIM No. 600105), dominant optic 
atrophy (OPAl, MIM No. 165500), dominant familial 
exudative vitreoretinopathy (EVR, MIM No. 133780), 
and dominant Wagner syndrome or erosive vitreoreti- 
nopathy (WGNl/ERVR, MIM No. 143200). In addition, 
one of these seven genes mapped within the candidate 
region for recessive mental retardation and retinitis 
pigmentosa, recently assigned to 15q24 (Mitchell etal., 
1998). Further laboratory investigation, including full- 
length cDNA sequencing and genomic characteriza- 
tion, followed by analysis of DNA samples from af- 
fected family members, will be necessary to determine 
whether mutations in any of these candidate genes 
cause inherited retinal diseases. 

Subsequent to the completion of this study, the latest 
GeneMap of the Human Genome was released (http:// 
Avww.ncbi.nlm.nih.gov/genemap/). STSs reported in 
GeneMap*98 confirm mapping of eight of the genes 
mapped in this study. However, GeneMap'98 reports 
only brain or pineal transcripts for four of these: 
THC180397 (Unigene Hs.4822, brain), THC153932 
(WI-18114, pineal gland), THC202304 (Unigene 
Hs.6535, pineal gland and brain), and THC137267 
(SGC35226, pineal gland). The gene for one of these 
clusters, THC153932, maps within the candidate re- 
gion for dominant optic atrophy (OPAl); however, be- 
cause GeneMap does not include evidence of retinal 
expression for this gene, it might not be considered a 
candidate for a retinal disease based on GeneMap 
alone. The four remaining genes with mapping con- 
firmed by GeneMap'98 include retina or retina and 
brain ESTs, but are not located within candidate re- 
gions; they are THC207703 (Unigene Hs.l2513), 
THC157357 (WI-20494), THC137161 (Unigene Hs. 
64616). and THC195887 (stSG40815). 

In conclusion, we report the identification and local- 
ization of 26 novel retina/pineal gland-expressed genes 
by a combination of database analysis and laboratory 
techniques. The expression pattern of these genes sug- 
gests the possibility of expression in photoreceptors, 
which is the expression pattern of several genes known 



to cause inherited retinal disorders. Further, 7 of these 
genes are candidates for the cause of known inherited 
retinal diseases. The combined approach of database 
analysis and laboratory investigation, iricorpbrating 
recognition of the biological relationship between pho- 
toreceptors and pinealocytes, is an effective technique 
for identification of candidate genes for inherited reti- 
nal disorders. 
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DETAILED ACTION 
Election/Restrictions 

1. Applicant's election of Group 2 in the reply filed on 16 June 2004 is acknowledged. 
Because applicant did not distinctly and specifically point out the supposed errors in the 
restriction requirement, the election has been treated as an election without traverse (MPEP 
§ 818.03(a)). 

2. Claim 39 is withdrawn fi-om further consideration pursuant to 37 CFR 1 . 142(b) as being 
drawn to a nonelected invention, there being no allowable generic or linking claim. Election was 
made without traverse in the reply filed on 16 June 2004. In the restriction requirement mailed 
claim 39 was omitted from nonelected Group 5, drawn to databases. Claim 39 is withdrawn in 
view of the election of Group 2. 

3. It is noted that the response filed 16 June 2004 contains a marked up copy of the claims 
as required by 37 CFR 1.121 and in addition contains an unnecessary unmarked copy of the 
claims that will not be considered to be the official copy of the claims, 

Prioriiy 

4. Applicant has not complied with one or more conditions for receiving the benefit of an 
earUer filing date under 3 5 U. S.C 11 9(e) as follows: 

An application in which the benefits of an earlier apphcation are desired must contain a 
specific reference to the prior appUcation(s) in the first sentence of the specification or in an 
apphcation data sheet (37 CFR 1.78(a)(2) and (a)(5)). The specific reference to any prior 
nonprovisional apphcation must include the relationship (i.e., continuation, divisional, or 
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continuation-in-part) between the applications except when the reference is to a prior appUcation 
of a CPA assigned the same application number. 

It is apparent from the rule 63 Declaration filed on 1 1 December 2001 that the applicants 
intended to claim the benefit of U.S. Provisional AppUcation No. 60/227099. However until the 
specification is amended to refer to the above application no claim for benefit will be recognized. 

Specification 

5. The sequence listing and computer readable form filed 17 March 2003 have been entered 
into the appUcation history, 

6. This appUcation contains sequence disclosures that are encompassed by the definitions 
for nucleotide and/or amino acid sequences set forth in 37 CFR §§ 1.821(a)(1) and (a)(2). 
However, this application fails to comply with the requirements of 37 CFR §§ 1.821-1.825 for 
the foUowing reasons: 

Several nucleotide sequences appear in the specification in figure 3 that are not properly 
identified. Nucleotide sequences must be identified by sequence identification number. 
Furthermore, if said sequences do not appear in the sequence Usting, a new Usting including said 
sequences must be supplied. It is often convenient to identify sequences in figures by amending 
the Brief Description of the Drawings section (see MPEP 2422.02). If said sequences consist of a 
portion of sequences already of record in the sequence listing, they may be identified in the 
specification using the existing SEQ ID No. accompanied by the position of the sequence on the 
already Usted sequence. 

Applicants are required to comply with all the requirements of 37 CFR §§ 1.821-1.825. 
Any response to this Office Action which fails to meet aU of these requirements will be 
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considered non-responsive. The nature of the sequences disclosed in the instant application has 
allowed an examination on the merits, the results of which are communicated below. 

7. The specification is objected to as failing to provide proper antecedent basis for the 
claimed subject matter. See 37 CFR 1.75(d)(1) and MPEP § 608,01(o). Correction of the 
following is required: The subject matter of claims 10-15 and 17 do not have antecedent basis in 
the specification. 

Claim Rejections - 35 USC § 112 

8. The following is a quotation of the first paragraph of 35 U.S.C. 1 12: 

The specification shall contain a written description of the invention, and of the manner and process of making 
and using it, ia such full, clear, concise, and exact terms as to enable any person skilled in the art to which it 
pertains, or with which it is most nearly connected, to make and use the same and shall set forth the best mode 
contemplated by the inventor of carrying out his invention. 

9. Claim 17 is rejected under 35 U.S.C. 1 12, first paragraph, as failing to comply with the 
written description requirement. The claim(s) contains subject matter which was not described 
in the specification in such a way as to reasonably convey to one skilled in the relevant art that 
the inventor(sX at the time the application was filed, had possession of the claimed invention. 

Claim 17 is drawn to methods that use a database encoded in a biological medium. The 
specification does not describe databases encoded in a biological medium. 

10. The following is a quotation of the second paragraph of 35 U.S.C. 1 12: 

The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the 
subject matter which the applicant regards as his invention. 

1 L Claims 5-16 are rejected under 35 U.S.C. 1 12, second paragraph, as being indefinite for 
failing to particularly point out and distinctly claim the subject matter which apphcant regards as 
the invention. 
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Claims 5-16 are indefinite for recitation of the phrase "said sequences" because it is not 
clear which of the sequences in the claims fi"om which claims 5-16 depend the phrase refers to. 

aaim Rejections - 35 USC §103 

12. The following is a quotation of 35 U.S. C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in 
section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are 
such that the subjecl ixiiilier an a wiiolc v/ould have been ob^'iou£ at Ihf- tiTne the invention was made to a person 
having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the 
manner in which the invention was made. 

13. The factual inquiries set forth in Graham v. John Deere Co,, 383 U.S. 1, 148 USPQ 459 
(1966), that are appUed for establishing a background for determining obviousness under 35 
U.S.C. 103(a) are summarized as follows: 

1 . Determining the scope and contents of the prior art. 

2. Ascertaining the differences between the prior art and the claims at issue. 

3 . Resolving the level of ordinary skill in the pertinent art. 

4. Considering objective evidence present in the application indicating obviousness 
or nonobviousness. 

14. Claims 2, 3, 5, 7, 8, 18-20, 27, and 30 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Jurka et al. (1996). 

The claims are drawn to a method of making a repeat sequence database by masking 
repeat sequences in a query sequence wherein the repeat sequences are in a repeat sequence 
database, and determining if any remaining unmatched sequences in the query sequence are 
repeat sequences in a repeat sequence database, and if such repeat sequences are determined in 
the query sequence, the query repeat sequences so determined are added to a repeat sequence 
database. In some embodiments the right and left endpomts of the match are determined, the 
sequences are DNA sequences, the sequences are human sequences, the repeat sequence 
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databases are internet accessible and on computer-readable media, and the matching of 
sequences are performed by a database search algorithm. In some embodiments the search 
algorithm is a Smith Waterman algorithm. 

Jurka et al. (1996) shows in the program description on pages 1 19-121 a database 
matching program called CENSOR. CENSOR determines whether a query sequence contains 
repeats that match sequences in a repeat sequence database. CENSOR censors those repeat 
sequences so that the remaining query sequence may by matched against the ua-labase of choice 
without giving undesirable matches to repeat sequences that have been censored. Jurka et al. 
(1996) shows on page 119 that in the art the terms censor and masking are equivalent. Jurka et al. 
shows matching of query sequences that are DNA and determination of the right and left 
endpoints of the match and masked regions in figure 1. Jurka et al. (1996) shows human 
repetitive databases in the introduction on page 1 19. Jurka et al. (1996) shows computer-based 
repeat sequence databases throughout, and use of LOCAL, a Smith Waterman database search 
algorithm throughout. Jurka et al. shows on page 121 that one use of CENSOR is to allow for 
masking of repeated sequence followed by a second matching to a repeat sequence database 
using different parameters for possible identification and censoring of more distant repeats. Jurka 
et al. (1996) does not show addition of repeats identified by comparison of a masked query 
sequence to a repeat sequence database. 

It would have been obvious to a person of ordinary skill in the art at the time the 
invention was made to modify the method of Jurka et al. (1996) by addition of newly determined 
repeat sequences to a repeat sequence database so that the repeat sequence database would be a 
more comprehensive hsting of repeat sequences. 
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15. Claims 2, 6, 15, 16, 19-24, 26-29, and 3 1-33 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above, 
and further in view of Altschul et al. 

The claims are drawn to the method of claim 2 further limited to analysis of 
ribonucleotide sequences, sequences that encode amino acid sequences, synthetic DNA such as 
cDNA, repeat sequence databases accessible through the internet, use of public domain databases 
GenBank, dbEST, and SwissProt, use of seaiuh alguiiLhins BLAST and FAST.A^ and use of 
scoring matrices PAM and BLOSUM. 

Jurka et al. (1996) as appUed to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above does not 
show the method of claim 2 further limited to analysis of ribonucleotide sequences, sequences 
that encode amino acid sequences, repeat sequence databases accessible through the internet, use 
of public domain databases GenBank, dbEST, and SwissProt, use of search algorithms BLAST 
and FAST A, and use of scoring matrices PAM and BLOSUM. 

Altschul et al. reviews searching sequence databases. Altschul et al. shows searching 
query sequences derived from mRNA such as cDNA that encode proteins on page 119 and 
figures 2 and 3. Altschul et al. shows repeat sequence databases accessible through the intemet 
used to mask query sequences on page 128. Altschul et al. shows pubhc domain databases 
GenBank on page 124, SwissProt on page 127, and dbEST on page 128 (reference 60). Altschul 
et al. shows use of BLAST and FASTA search algorithms on page 120 and use of scoring 
matrices PAM and BLOSUM on pages 123-124. 

It would have been obvious to a person of ordinary skill in the art at the time the 
invention was made to modify the method of Jurka et al. (1996) as appUed to claims 2, 3, 5, 7, 8, 
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18-20, 27, and 30 above by use of analysis of ribonucleotide sequences, sequences that encode 
amino acid sequences, repeat sequence databases accessible through the internet, use of public 
domain databases GenBank, dbEST, and SwissProt, use of search algorithms BLAST and 
FASTA, and use of scoring matrices PAM and BLOSUM because Altschul et al. shows use of 
all of those features in the context of searching sequence databases with query sequences whose 
repeat sequences have been masked. 

16, Claims 2, and 7-14 are rejected uuder 35 U.S.C. lG3(a) as being unpatentable over Jurka 
et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above, and further m view of 
Jurka (1998). 

The claims are drawn to the method of claim 2 utilizing sequences from mice, plants, 
ftingi, and microorganisms. 

Jurka (1998) reviews repeat sequences from a variety of organisms. Jurka (1998) points 
to mouse repeat sequences on page 334 and table 1 . 

It would have been obvious to a person of ordinary skill in the art at the time the 
invention was made to modify the method of Jurka et al. (1996) as appUed to claims 2, 3, 5, 7, 8, 
18-20, 27, and 30 above by use of repeat sequences from a variety of organisms so that 
corresponding query sequences from the organisms could be analyzed and masked. 

17. Claims 2, 22, and 25 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Jurka et al. (1996) as appUed to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above, and fiirther in view 
of Sohocki et al. 

The claims are drawn to the method of claim 2 fiirther Umited to use of a TIGR database. 
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Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above cloes not . 
show use of a TIGR database. 

Sohocki et al. shows in the abstract and throughout use of the TIGR Human Gene Index 
database to search for genes for mherited retinal disorders. 

It would have been obvious to a person of ordinary skill in the art at the time the 
invention was made to modify the method of Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 
15-20, 27, aiiu 30 above by use of the TIGP. Hiim^.n Gene Index database because Sohocki et al. 
shows that the database is a useful source of human genes such as genes related to inherited 
retinal disorders. 

Conclusion 

18. Any inquiry of a general nature or relating to the status of this application or proceeding 
should be directed to (571) 272-0547. 

19. Patent apphcants with problems or questions regarding electronic images that can be 
viewed in the Patent AppUcation Information Retrieval system (PAIR) can now contact the 
USPTO's Patent Electronic Business Center (Patent EBC) for assistance. Representatives are 
available to answer your questions daily from 6 am to midnight (EST). The toll free number is 
(866) 217-9197. When calling please have your apphcation serial or patent number, the type of 
document you are having an image problem with, the number of pages and the specific nature of 
the problem. The Patent Electronic Business Center will notify apphcants of the resolution of 
the problem within 5-7 business days. Applicants can also check PAIR to confurm that the 
problem has been corrected. The USPTO's Patent Electronic Business Center is a complete 
service center supporting all patent business on the Internet. The USPTO's PAIR system 
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provides Internet-based access to patent application status and history information. It also 
enables applicants to view the scanned images of their own appUcation file folder(s) as well as 
general patent information available to the public. 

For all other customer support, please call the USPTO Call Center (UCC) at 800-786- 

9199. 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Joiui S. Brusca v/hcse telephone rxiiinber is (571) 272-0714. The 
examiner can normally be reached on M-F 8:30-5:00. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Michael Woodward can be reached on (571) 272-0722. The fax phone number for 
the organization where this apphcation or proceeding is assigned is 703-872-9306. 

Information regarding the status of an appUcation may be obtained from the Patent 
Application Information Retrieval (PAIR) system. Status information for published applications 
may be obtained from either Private PAIR or PubUc PAIR, Status information for unpublished 
appUcations is available through Private PAIR only. For more information about the PAIR 
system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR 
system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). 




John S. Brusca 
Primary Examiner 
Art Unit 1631 
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